Name | Version | Summary | date |
html-to-markdown |
1.13.0 |
A modern, type-safe Python library for converting HTML to Markdown with comprehensive tag support and customizable options |
2025-09-16 05:35:37 |
kreuzberg |
3.15.0 |
Document intelligence framework for Python - Extract text, metadata, and structured data from diverse file formats |
2025-09-14 18:14:57 |
docstrange |
1.1.6 |
Extract and Convert PDF, Word, PowerPoint, Excel, images, URLs into multiple formats (Markdown, JSON, CSV, HTML) with intelligent content extraction and advanced OCR. |
2025-09-10 09:27:30 |
mseep-kreuzberg |
3.13.4 |
Document intelligence framework for Python - Extract text, metadata, and structured data from diverse file formats |
2025-09-09 03:44:56 |
mcp-pdf |
1.0.1 |
Secure FastMCP server for comprehensive PDF processing - text extraction, OCR, table extraction, forms, annotations, and more |
2025-09-07 07:00:52 |
docuglean-ocr |
1.0.0 |
An SDK for intelligent document processing using SOTA VLLM models |
2025-09-02 13:19:12 |
pyrtex |
0.2.1 |
A Python library for batch text extraction and processing using Google Cloud Vertex AI |
2025-08-31 14:08:52 |
img2text-cli |
0.1.7 |
A CLI tool to extract text from images using OCR |
2025-08-31 11:21:20 |
wizarddocx |
1.0.0 |
Text extraction from Microsoft Word files. Parses Word documents natively and can optionally run local OCR with Tesseract for embedded images or scanned pages. Supports page selection and bytes input. Legacy .doc is read-only and OCR is not available. |
2025-08-28 09:27:49 |
iflow-mcp_langextract-mcp |
0.1.1 |
FastMCP server for Google's langextract library - extract structured information from unstructured text using LLMs |
2025-08-26 08:42:35 |
mcp-gosling |
0.1.0 |
MCP Gosling - Advanced document processing server for Goose AI using IBM's Docling library |
2025-08-25 02:12:32 |
ocr-detection |
0.4.1 |
A Python library to detect whether PDF pages contain extractable text or are scanned images requiring OCR |
2025-08-22 07:27:10 |
upspawn-ocr-cli |
0.1.0b3 |
Modern, polished CLI to extract text from PDFs using the Mistral OCR API. |
2025-08-15 23:24:29 |
hashub-docapp |
1.0.0 |
Professional Python SDK for the HashubDocApp API - Advanced OCR, document conversion, and text extraction service |
2025-08-15 12:09:58 |
extract-hwp |
0.1.0 |
Python library for extracting text from Korean HWP files (HWP 5.0 and HWPX formats) |
2025-08-13 10:13:03 |
pdf-tools-mcp |
0.1.4 |
A FastMCP-based PDF reading and manipulation tool server |
2025-08-10 17:21:33 |
document-data-extractor |
1.0.4 |
Best open-source document to markdown extractor for LLM training data. Convert PDF, Word, PowerPoint, Excel, images, URLs to clean markdown, JSON, HTML locally. Alternative to Unstructured, Docling, Marker, MarkItDown, MinerU, PaddleOCR, Tesseract |
2025-07-29 08:25:56 |
llm-data-converter |
2.2.0 |
Best open-source document to markdown converter for LLM training data. Convert PDF, Word, PowerPoint, Excel, images, URLs to clean markdown, JSON, HTML locally. Alternative to Unstructured, Docling, Marker, MarkItDown, MinerU, PaddleOCR, Tesseract |
2025-07-25 13:32:07 |
pdfhandleretc |
0.1.1 |
Lightweight command-line and Python API toolkit for PDF text extraction, encryption, permissions, and more. |
2025-07-16 04:04:16 |
pdf-ocr-processor |
2.0.3 |
Advanced PDF OCR processing with AI-powered text extraction and selectable text overlays |
2025-07-11 21:11:24 |